Exploration of Projection Spaces¶

In [16]:
# Feel free to add dependencies, but make sure that they are included in environment.yml

#disable some annoying warnings
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

#plots the figures in place instead of a new window
%matplotlib inline

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import altair as alt
from altair import datum
alt.data_transformers.disable_max_rows()

from sklearn import manifold
from openTSNE import TSNE
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import ParameterGrid
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
import plotly.express as px
In [17]:
# Please install this to be able to read excel, we put this into the .yml file as well in case...

#pip install openpyxl

Data¶

To be able to explore paths in a projected space, you need to pick a problem/algorithm/model that consists of multiple states that change iteratively.

Click to see an Example

An example is the solving of a Rubik's Cube. After each rotation the state of the cube changes. This results in a path from the initial state, through the individual rotations, to the solved cube. By using projection, we can examine the individual states and paths in the two-dimensional space. Depending on the initial state and the solution strategy the paths will differ or resemble each other.

This is an example of solving 10 randomly scrambled Rubik's Cubes with two different strategies, the Beginner (in green) and the Fridrich Method (in orange):

Rubiks's Cube Sovling Strategies
You can see that although each cube is scrambled differently in the beginning, both strategies converge to the same paths after a few steps. You can also notice that the Beginner's method takes some additional paths that are not necessary with the Fridrich method.

Read and Prepare Data¶

Read in your data from a file or create your own data.

Document any data processing steps.

In [18]:
# TODO

import os
import re
import pandas as pd

# Define board coordinates
board_size = 19
coordinates = [chr(97 + row) + chr(97 + col) for row in range(board_size) for col in range(board_size)]

# Define columns for the final DataFrame
columns = ["game_id", "move_id", "color", "winner_color", "winner_score", "result", "rules", "handicap", "starter_player", "step_count"] + coordinates

# Initialize an empty DataFrame
all_games_df = pd.DataFrame(columns=columns)

# Helper functions
def parse_metadata_and_moves(sgf_content):
    moves = []
    # Determine rules and handicap
    rules = "Unknown"
    handicap = 0

    if "RU[" in sgf_content:
        rules = sgf_content.split("RU[")[1].split("]")[0]
    if "HA[" in sgf_content:
        handicap = int(sgf_content.split("HA[")[1].split("]")[0])
    
    # Determine winner information
    winner_color = "black" if "RE[B+" in sgf_content else "white"
    result = "resign" if "Resign" in sgf_content else "score"
    winner_score = None
    if result == "score":
        winner_score = sgf_content.split("RE[")[1].split("]")[0][2:]
    
    # Extract moves
    sgf_moves = sgf_content.split(";")[2:]  # Skip the first two header parts
    move_id = 1
    for move in sgf_moves:
        color = "black" if move.startswith("B") else "white"
        pos = move[2:4]
        moves.append((move_id, color, pos))
        move_id += 1
    
    # Determine starter player based on the first move
    starter_player = moves[0][1] if moves else "unknown"  # "unknown" if there are no moves
    return moves, winner_color, winner_score, result, rules, handicap, starter_player

# Initialize board with all 0s
def initialize_board():
    return {coord: 0 for coord in coordinates}

# Apply moves and log each step for a single game
def process_game(game_id, moves, winner_color, winner_score, result, rules, handicap, starter_player):
    board_state = initialize_board()
    data = []
    step_count = len(moves)

    for move_id, color, pos in moves:
        # Update board with current move
        board_state[pos] = 2 if color == "black" else 1  # 2 for Black, 1 for White
        row_data = {
            "game_id": game_id,
            "move_id": move_id,
            "color": color,
            "winner_color": winner_color,
            "winner_score": winner_score,
            "result": result,
            "rules": rules,
            "handicap": handicap,
            "starter_player": starter_player,
            "step_count": step_count
        }
        row_data.update(board_state)
        data.append(row_data.copy())
        # Reset the stone to 0 for the next move
        board_state[pos] = 0
    return data

# Parse and process multiple SGF files in a folder
def process_sgf_folder(folder_path):
    global all_games_df
    game_files = []

    # Gather files with their game IDs
    for filename in os.listdir(folder_path):
        if filename.endswith(".sgf"):
            # Extract the game ID from the filename using regex
            match = re.match(r"(\d+)_", filename)
            if match:
                game_id = int(match.group(1))
                game_files.append((game_id, filename))

    # Sort files by game_id
    game_files.sort(key=lambda x: x[0])

    # Process each file in sorted order
    for game_id, filename in game_files:
        with open(os.path.join(folder_path, filename), "r", encoding="utf-8") as file:
            sgf_content = file.read()
        moves, winner_color, winner_score, result, rules, handicap, starter_player = parse_metadata_and_moves(sgf_content)
        game_data = process_game(game_id, moves, winner_color, winner_score, result, rules, handicap, starter_player)
        game_df = pd.DataFrame(game_data, columns=columns)
        all_games_df = pd.concat([all_games_df, game_df], ignore_index=True)

# Specify the folder path and the player's color
folder_path = "final\SGF"  # Update with the path to your folder

# Process the SGF files
#process_sgf_folder(folder_path) # we dont have to run it again

# Save the merged DataFrame to an Excel file
output_path = "merged_games_data_sorted_final_new.xlsx"
#all_games_df.to_excel(output_path, index=False)

#print(f"Data saved to {output_path}")

"""
# Load the score table
score_table_path = "final\Scores_key.xlsx"  # Update with your score table path
score_df = pd.read_excel(score_table_path)

# Rename 'ID' column in score table to 'game_id' to match main DataFrame
score_df = score_df.rename(columns={"Id": "game_id"})

# Merge the score table with all_games_df on 'game_id'
merged_df = pd.merge(all_games_df, score_df, on="game_id", how="left")

# Save the merged DataFrame to an Excel file
output_path = "merged_games_with_scores_final_new.xlsx"
merged_df.to_excel(output_path, index=False)

"""
Out[18]:
'\n# Load the score table\nscore_table_path = "final\\Scores_key.xlsx"  # Update with your score table path\nscore_df = pd.read_excel(score_table_path)\n\n# Rename \'ID\' column in score table to \'game_id\' to match main DataFrame\nscore_df = score_df.rename(columns={"Id": "game_id"})\n\n# Merge the score table with all_games_df on \'game_id\'\nmerged_df = pd.merge(all_games_df, score_df, on="game_id", how="left")\n\n# Save the merged DataFrame to an Excel file\noutput_path = "merged_games_with_scores_final_new.xlsx"\nmerged_df.to_excel(output_path, index=False)\n\n'

Comments¶

  • Did you transform, clean, or extend the data? How/Why?

We downloaded GO games in .sgf files, and read them from a folder, extracted some useful information with text functions from the text, and extracted the steps from each game (board states). Some example meta data features: -player colors, rule, handicap, time, result type...etc

We also used another helper table (GOgame\Scores_key.xlsx), where we got the area and territory scores from the .sgf files visualized on this webpage: https://speedtesting.herokuapp.com/sgfviewer/#google_vignette, made some categories as well, for example whether the player is a beginner/ intermediate / master level.

Threshold: 40 - 80: Beginner 80 - 200: Intermediate 200 - : Master

Then we merged these 2 tables into one based on the game_id-s.

Final table: GOgame\merged_games_with_scores_final.xlsx

Projection¶

Project your data into a 2D space. Try multiple (3+) projection methods (e.g., t-SNE, UMAP, MDS, PCA, ICA, other methods) with different settings and compare them.

Make sure that all additional dependencies are included when submitting.

In [ ]:
# Load dataset
data = pd.read_excel('/GOgame/merged_games_with_scores_final_new.xlsx')

# Selecting only the board state columns for dimensionality reduction
board_state_columns = data.loc[:, 'aa':'ss']

info_cols=[col for col in data.columns if col not in board_state_columns]

# Selecting only the board state columns for dimensionality reduction
board_state_columns = data.loc[:, 'aa':'ss']

# player_info_columns for merged_games_with_scores_final.xlsx
player_info_columns = data[info_cols]

Here You can see the meta data features that we collected.

In [20]:
player_info_columns.head()
Out[20]:
game_id move_id color winner_color winner_score result rules handicap starter_player step_count ... Date Our_players_colour Area score Area score of opponent Area_winner_color Area_result Territory_score Territory_score_of_opponent Territory_winner_color Territory_result
0 1 1 white white Time score Japanese 4 white 105 ... September, 2024 W 54.5 56.0 B 1,5 4.5 3.0 W 1,5
1 1 2 black white Time score Japanese 4 white 105 ... September, 2024 W 54.5 56.0 B 1,5 4.5 3.0 W 1,5
2 1 3 white white Time score Japanese 4 white 105 ... September, 2024 W 54.5 56.0 B 1,5 4.5 3.0 W 1,5
3 1 4 black white Time score Japanese 4 white 105 ... September, 2024 W 54.5 56.0 B 1,5 4.5 3.0 W 1,5
4 1 5 white white Time score Japanese 4 white 105 ... September, 2024 W 54.5 56.0 B 1,5 4.5 3.0 W 1,5

5 rows × 23 columns

The board states at each move, the position of the stone placed on the board:

In [21]:
board_state_columns
Out[21]:
aa ab ac ad ae af ag ah ai aj ... sj sk sl sm sn so sp sq sr ss
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9750 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9751 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9752 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9753 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9754 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

9755 rows × 361 columns

PCA¶

In [22]:
# Standardize the board state data
scaler = StandardScaler()
board_state_scaled = scaler.fit_transform(board_state_columns)

# Apply PCA to the board state columns
pca = PCA()
board_state_pca = pca.fit_transform(board_state_scaled)

# Plot the explained variance ratio
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by Principal Components (Board State)')
plt.grid(True)
plt.show()

# Checking how many components explain a significant amount of variance
explained_variance = pca.explained_variance_ratio_

# Summary of PCA
print(f"Explained variance by each component: {explained_variance}")
print(f"Total explained variance: {np.cumsum(explained_variance)}")
No description has been provided for this image
Explained variance by each component: [0.00278106 0.00278098 0.00278027 0.00277979 0.00277972 0.00277971
 0.00277958 0.00277955 0.00277954 0.00277947 0.00277943 0.00277942
 0.00277935 0.00277931 0.00277928 0.00277927 0.00277925 0.00277924
 0.00277922 0.0027792  0.00277911 0.00277907 0.00277907 0.00277904
 0.00277901 0.002779   0.00277899 0.00277898 0.00277895 0.00277894
 0.00277892 0.00277891 0.00277888 0.00277888 0.00277887 0.00277884
 0.00277884 0.00277883 0.0027788  0.0027788  0.0027788  0.00277879
 0.00277878 0.00277877 0.00277876 0.00277874 0.00277874 0.00277874
 0.00277873 0.00277872 0.00277871 0.0027787  0.00277869 0.00277869
 0.00277864 0.00277859 0.00277856 0.00277856 0.00277854 0.00277845
 0.00277845 0.00277844 0.00277844 0.00277843 0.0027784  0.00277836
 0.00277836 0.00277836 0.00277835 0.00277832 0.00277832 0.00277832
 0.00277832 0.0027783  0.00277829 0.00277829 0.00277827 0.00277825
 0.00277822 0.00277821 0.0027782  0.00277819 0.00277817 0.00277816
 0.00277813 0.00277813 0.00277812 0.00277808 0.00277808 0.00277807
 0.00277805 0.00277805 0.00277805 0.00277803 0.00277801 0.00277801
 0.002778   0.00277799 0.00277798 0.00277798 0.00277797 0.00277796
 0.00277796 0.00277796 0.00277795 0.00277794 0.00277794 0.00277794
 0.00277793 0.00277791 0.00277785 0.00277785 0.00277785 0.00277785
 0.00277785 0.00277785 0.00277783 0.00277781 0.00277781 0.00277781
 0.00277781 0.0027778  0.00277777 0.00277777 0.00277777 0.00277777
 0.00277777 0.00277777 0.00277777 0.00277775 0.00277774 0.00277774
 0.00277773 0.00277771 0.00277771 0.00277771 0.00277771 0.0027777
 0.00277769 0.00277769 0.00277768 0.00277768 0.00277767 0.00277766
 0.00277766 0.00277766 0.00277766 0.00277763 0.00277761 0.00277761
 0.00277759 0.00277757 0.00277757 0.00277757 0.00277755 0.00277754
 0.00277753 0.00277753 0.00277751 0.0027775  0.00277749 0.00277747
 0.00277747 0.00277747 0.00277747 0.00277747 0.00277746 0.00277745
 0.00277744 0.00277744 0.00277743 0.00277743 0.00277743 0.00277743
 0.00277743 0.00277743 0.00277742 0.00277739 0.00277738 0.00277738
 0.00277738 0.00277734 0.00277734 0.00277732 0.00277731 0.00277729
 0.00277729 0.00277729 0.00277727 0.00277726 0.00277724 0.00277723
 0.00277723 0.0027772  0.00277718 0.00277718 0.00277718 0.00277718
 0.00277716 0.00277715 0.00277712 0.0027771  0.0027771  0.00277709
 0.00277707 0.00277706 0.00277704 0.00277702 0.00277702 0.002777
 0.00277698 0.00277698 0.00277698 0.00277698 0.00277697 0.00277696
 0.00277696 0.00277695 0.00277693 0.00277693 0.00277693 0.00277693
 0.00277692 0.00277692 0.00277692 0.00277692 0.00277692 0.00277692
 0.00277687 0.00277683 0.0027768  0.00277678 0.00277678 0.00277678
 0.00277676 0.00277674 0.00277674 0.00277674 0.00277672 0.00277671
 0.00277671 0.0027767  0.00277669 0.00277669 0.00277669 0.00277669
 0.00277668 0.00277666 0.0027766  0.00277659 0.00277656 0.00277654
 0.00277654 0.00277654 0.00277654 0.00277651 0.0027765  0.00277648
 0.00277647 0.00277647 0.00277646 0.00277645 0.00277644 0.00277643
 0.00277642 0.00277641 0.00277641 0.00277631 0.0027763  0.00277625
 0.00277623 0.00277621 0.0027762  0.00277619 0.00277618 0.00277618
 0.00277616 0.00277613 0.00277609 0.00277608 0.00277608 0.00277603
 0.00277599 0.00277592 0.00277577 0.00277575 0.00277575 0.00277568
 0.00277565 0.00277549 0.00277547 0.00277545 0.00277541 0.00277522
 0.00277519 0.00277517 0.00277516 0.00277515 0.00277514 0.00277511
 0.00277506 0.00277501 0.00277497 0.00277491 0.00277489 0.00277483
 0.00277479 0.00277477 0.00277474 0.00277473 0.0027747  0.00277467
 0.00277466 0.00277465 0.00277464 0.00277464 0.00277455 0.0027745
 0.00277442 0.00277439 0.00277429 0.00277426 0.00277426 0.00277423
 0.00277422 0.00277422 0.00277422 0.00277419 0.00277418 0.00277415
 0.00277415 0.00277399 0.00277395 0.00277394 0.00277392 0.00277391
 0.00277391 0.00277389 0.00277388 0.00277387 0.00277372 0.0027737
 0.0027737  0.00277364 0.00277363 0.00277344 0.00277343 0.0027734
 0.00277339 0.00277325 0.00277324 0.00277316 0.00277313 0.00277302
 0.00277265 0.00277246 0.00277195 0.00277187 0.00277186 0.00277138
 0.00028717]
Total explained variance: [0.00278106 0.00556204 0.00834232 0.01112211 0.01390183 0.01668154
 0.01946112 0.02224068 0.02502022 0.02779969 0.03057913 0.03335855
 0.0361379  0.03891721 0.04169649 0.04447576 0.04725501 0.05003426
 0.05281348 0.05559267 0.05837178 0.06115086 0.06392993 0.06670896
 0.06948797 0.07226698 0.07504596 0.07782494 0.08060389 0.08338284
 0.08616176 0.08894067 0.09171955 0.09449842 0.09727729 0.10005613
 0.10283496 0.10561379 0.10839259 0.11117139 0.11395019 0.11672898
 0.11950776 0.12228653 0.12506529 0.12784403 0.13062277 0.1334015
 0.13618023 0.13895895 0.14173766 0.14451636 0.14729505 0.15007374
 0.15285239 0.15563098 0.15840954 0.1611881  0.16396664 0.16674509
 0.16952354 0.17230198 0.17508043 0.17785885 0.18063726 0.18341562
 0.18619398 0.18897234 0.19175069 0.19452902 0.19730734 0.20008566
 0.20286398 0.20564229 0.20842057 0.21119886 0.21397713 0.21675537
 0.21953359 0.22231181 0.22509001 0.2278682  0.23064637 0.23342453
 0.23620266 0.23898079 0.2417589  0.24453699 0.24731507 0.25009315
 0.2528712  0.25564924 0.25842729 0.26120532 0.26398333 0.26676135
 0.26953935 0.27231734 0.27509532 0.27787331 0.28065128 0.28342924
 0.2862072  0.28898516 0.29176311 0.29454105 0.29731899 0.30009694
 0.30287487 0.30565277 0.30843062 0.31120847 0.31398632 0.31676417
 0.31954202 0.32231987 0.3250977  0.32787551 0.33065332 0.33343112
 0.33620893 0.33898673 0.3417645  0.34454227 0.34732004 0.35009781
 0.35287558 0.35565336 0.35843113 0.36120888 0.36398662 0.36676436
 0.36954209 0.3723198  0.37509751 0.37787523 0.38065294 0.38343064
 0.38620833 0.38898602 0.3917637  0.39454138 0.39731905 0.40009671
 0.40287438 0.40565204 0.4084297  0.41120733 0.41398494 0.41676255
 0.41954015 0.42231772 0.42509529 0.42787286 0.43065041 0.43342795
 0.43620548 0.43898301 0.44176053 0.44453802 0.44731551 0.45009298
 0.45287045 0.45564792 0.45842539 0.46120286 0.46398032 0.46675776
 0.4695352  0.47231264 0.47509007 0.4778675  0.48064493 0.48342236
 0.48619979 0.48897722 0.49175465 0.49453204 0.49730942 0.5000868
 0.50286418 0.50564153 0.50841886 0.51119618 0.51397349 0.51675078
 0.51952808 0.52230537 0.52508264 0.52785989 0.53063713 0.53341436
 0.53619158 0.53896878 0.54174596 0.54452315 0.54730033 0.55007751
 0.55285466 0.55563181 0.55840893 0.56118603 0.56396313 0.56674023
 0.56951729 0.57229435 0.57507139 0.57784841 0.58062543 0.58340243
 0.58617941 0.5889564  0.59173338 0.59451036 0.59728732 0.60006428
 0.60284124 0.60561818 0.60839511 0.61117205 0.61394898 0.61672591
 0.61950283 0.62227975 0.62505667 0.62783359 0.63061051 0.63338743
 0.6361643  0.63894113 0.64171793 0.64449471 0.64727149 0.65004827
 0.65282503 0.65560177 0.65837852 0.66115526 0.66393199 0.6667087
 0.66948541 0.67226211 0.6750388  0.67781548 0.68059217 0.68336886
 0.68614554 0.6889222  0.6916988  0.69447539 0.69725195 0.7000285
 0.70280504 0.70558159 0.70835813 0.71113464 0.71391115 0.71668763
 0.7194641  0.72224057 0.72501704 0.72779349 0.73056993 0.73334636
 0.73612278 0.73889919 0.7416756  0.74445192 0.74722822 0.75000446
 0.75278069 0.7555569  0.7583331  0.76110929 0.76388547 0.76666165
 0.76943781 0.77221394 0.77499003 0.7777661  0.78054218 0.78331821
 0.78609421 0.78887012 0.79164589 0.79442165 0.7971974  0.79997308
 0.80274873 0.80552422 0.80829969 0.81107514 0.81385055 0.81662577
 0.81940096 0.82217613 0.82495129 0.82772643 0.83050158 0.83327669
 0.83605175 0.83882676 0.84160173 0.84437664 0.84715154 0.84992637
 0.85270115 0.85547593 0.85825067 0.8610254  0.8638001  0.86657477
 0.86934942 0.87212407 0.87489871 0.87767335 0.8804479  0.8832224
 0.88599682 0.88877121 0.8915455  0.89431976 0.89709402 0.89986825
 0.90264247 0.90541669 0.9081909  0.91096509 0.91373927 0.91651342
 0.91928757 0.92206156 0.92483551 0.92760945 0.93038337 0.93315728
 0.93593119 0.93870507 0.94147895 0.94425282 0.94702654 0.94980024
 0.95257394 0.95534759 0.95812121 0.96089465 0.96366808 0.96644148
 0.96921487 0.97198812 0.97476136 0.97753452 0.98030765 0.98308067
 0.98585332 0.98862577 0.99139772 0.99416958 0.99694144 0.99971283
 1.        ]

It doesn't make any sense to use PCA on the board states, as no feature (in this case board position) seems to be more important than the others.

PCA for the player info columns

In [23]:
# !!!! winner_score and Rank are excluded here!!!!
# feature names in merged_games_with_scores_final_new.xlsx
numerical_features = ['handicap', 'step_count', 
                      'Area score', 'Area score of opponent', 'Area_result', 'Territory_score', 
                      'Territory_score_of_opponent', 'Territory_result']
categorical_features = ['game_id', 'move_id', 'color', 
                        'winner_color', 'result', 'rules', 'starter_player', 
                        'level', 'Player', 'Our_players_colour', 'Area_winner_color', 'Territory_winner_color']

# Converting numerical features to numeric (int/float)
for feature in numerical_features:
    player_info_columns[feature] = pd.to_numeric(player_info_columns[feature], errors='coerce')

# Handling any NaN values that result from conversion
# For example, filling NaN values with the mean of the column
player_info_columns[numerical_features] = player_info_columns[numerical_features].fillna(
    player_info_columns[numerical_features].mean()
)
/tmp/ipykernel_8097/578163067.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  player_info_columns[feature] = pd.to_numeric(player_info_columns[feature], errors='coerce')
/tmp/ipykernel_8097/578163067.py:16: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  player_info_columns[numerical_features] = player_info_columns[numerical_features].fillna(
In [24]:
# Define preprocessor with dense output for OneHotEncoder
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(sparse_output=False), categorical_features)
    ]
)

# Create a pipeline for PCA
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('pca', PCA())
])

# Apply PCA to the player info columns
player_info_pca = pipeline.fit_transform(player_info_columns)

# Extract PCA component and explained variance
pca = pipeline.named_steps['pca']
explained_variance = pca.explained_variance_ratio_

# Get feature names from the preprocessor
onehot_feature_names = pipeline.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(categorical_features)
all_feature_names = numerical_features + list(onehot_feature_names)

# Analyze the PCA components to see the most important features
pca_components = pca.components_

# Print out the most important features for each principal component
print("\nMost important features for each principal component:\n")

for i, component in enumerate(pca_components[:10]):
    important_features = sorted(zip(all_feature_names, component), key=lambda x: abs(x[1]), reverse=True)
    print(f"Principal Component {i + 1}:")
    for feature, loading in important_features[:10]:  # Print top 10 features for each component
        print(f"  {feature}: {loading:.4f}")
    print()
Most important features for each principal component:

Principal Component 1:
  Territory_score_of_opponent: 0.4368
  Area score of opponent: 0.4179
  Area score: -0.4079
  Territory_result: 0.3632
  Area_result: 0.3628
  Territory_score: -0.3218
  level_1: 0.1539
  step_count: 0.1083
  level_3: -0.0993
  winner_color_black: 0.0784

Principal Component 2:
  handicap: -0.4796
  Territory_score: 0.4090
  Territory_result: 0.4058
  Area_result: 0.4036
  Area score: 0.3122
  Area score of opponent: -0.1736
  starter_player_black: 0.1335
  starter_player_white: -0.1335
  result_resign: -0.1215
  result_score: 0.1215

Principal Component 3:
  step_count: -0.6448
  Area score of opponent: -0.2564
  Area_winner_color_B: 0.2239
  Area_winner_color_W: -0.2239
  Territory_winner_color_B: 0.2060
  Territory_winner_color_W: -0.2060
  starter_player_black: -0.1937
  starter_player_white: 0.1937
  rules_Japanese: 0.1803
  Player_mangochia: -0.1795

Principal Component 4:
  handicap: -0.5985
  starter_player_black: 0.3295
  starter_player_white: -0.3295
  step_count: -0.2515
  Territory_score: -0.2351
  Area score: -0.2326
  Territory_result: -0.2272
  Area_result: -0.2236
  result_resign: 0.1600
  result_score: -0.1600

Principal Component 5:
  step_count: -0.3695
  Area_winner_color_W: 0.3110
  Area_winner_color_B: -0.3110
  winner_color_white: 0.3062
  winner_color_black: -0.3062
  Territory_winner_color_W: 0.3045
  Territory_winner_color_B: -0.3045
  Territory_score_of_opponent: 0.2136
  Territory_score: 0.1763
  Area_result: 0.1619

Principal Component 6:
  Territory_score: 0.3927
  Territory_score_of_opponent: 0.3471
  level_2: -0.3379
  result_resign: 0.2836
  result_score: -0.2836
  level_3: 0.2267
  Area score of opponent: 0.2035
  rules_Chinese: 0.1831
  Territory_winner_color_B: 0.1805
  Territory_winner_color_W: -0.1805

Principal Component 7:
  Area score of opponent: -0.3894
  Territory_score_of_opponent: -0.3578
  Player_mangochia: 0.3023
  rules_Chinese: 0.2473
  rules_Japanese: -0.2461
  Territory_result: 0.2363
  Area_result: 0.2332
  Territory_score: -0.2262
  handicap: 0.2142
  result_resign: 0.2113

Principal Component 8:
  Our_players_colour_B: -0.4948
  Our_players_colour_W: 0.4948
  level_2: -0.3096
  result_resign: -0.2420
  result_score: 0.2420
  step_count: -0.2403
  level_1: 0.2349
  winner_color_black: -0.1620
  winner_color_white: 0.1620
  rules_Japanese: -0.1210

Principal Component 9:
  color_white: -0.7071
  color_black: 0.7071
  result_resign: -0.0034
  result_score: 0.0034
  step_count: -0.0028
  Our_players_colour_B: -0.0023
  Our_players_colour_W: 0.0023
  Player_mangochia: -0.0023
  rules_Chinese: -0.0018
  rules_Japanese: 0.0015

Principal Component 10:
  winner_color_white: -0.3732
  winner_color_black: 0.3732
  Our_players_colour_B: -0.2926
  Our_players_colour_W: 0.2926
  result_resign: 0.2832
  result_score: -0.2832
  Area_winner_color_B: -0.1914
  Area_winner_color_W: 0.1914
  rules_Japanese: 0.1746
  Player_mangochia: -0.1572

In [25]:
# Plot the explained variance ratio
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(explained_variance))
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by Principal Components (Player Info)')
plt.grid(True)
plt.show()

# Plot the same explained variance ratio plot as before but only displaying the first 10 principal components
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(explained_variance)[:10], marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by the first 10 Principal Components (Player Info)')
plt.xticks(ticks=range(10), labels=range(1, 10 + 1))  
plt.grid(True)
plt.show()
No description has been provided for this image
No description has been provided for this image

PCA for every feature (both player info columns and board states)

In [26]:
all_data = pd.concat([player_info_columns, board_state_columns], axis=1)
numerical_features.extend(board_state_columns.columns)
In [27]:
all_data.head()
Out[27]:
game_id move_id color winner_color winner_score result rules handicap starter_player step_count ... sj sk sl sm sn so sp sq sr ss
0 1 1 white white Time score Japanese 4 white 105 ... 0 0 0 0 0 0 0 0 0 0
1 1 2 black white Time score Japanese 4 white 105 ... 0 0 0 0 0 0 0 0 0 0
2 1 3 white white Time score Japanese 4 white 105 ... 0 0 0 0 0 0 0 0 0 0
3 1 4 black white Time score Japanese 4 white 105 ... 0 0 0 0 0 0 0 0 0 0
4 1 5 white white Time score Japanese 4 white 105 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 384 columns

In [28]:
# Preprocessing: One-Hot Encoding for categorical features and Standardization for numerical features

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(), categorical_features)
    ]
)

# Create a pipeline for PCA
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('pca', PCA())
])

# Apply PCA to the player info columns
all_data_pca = pipeline.fit_transform(all_data)

# Extract PCA component and explained variance
pca = pipeline.named_steps['pca']
explained_variance = pca.explained_variance_ratio_

# Get feature names from the preprocessor
# Note: OneHotEncoder generates multiple columns for each category, so we need to extract all feature names
onehot_feature_names = pipeline.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(categorical_features)
all_feature_names = numerical_features + list(onehot_feature_names)

# Analyze the PCA components to see the most important features
pca_components = pca.components_

# Print out the most important features for each principal component
print("\nMost important features for each principal component:\n")
for i, component in enumerate(pca_components[:10]):
    # Get the top features for this component, sorted by the absolute value of their loadings
    important_features = sorted(zip(all_feature_names, component), key=lambda x: abs(x[1]), reverse=True)
    print(f"Principal Component {i + 1}:")
    for feature, loading in important_features[:10]:  # Print top 10 features for each component
        print(f"  {feature}: {loading:.4f}")
    print()
Most important features for each principal component:

Principal Component 1:
  Territory_score_of_opponent: 0.4352
  Area score of opponent: 0.4163
  Area score: -0.4071
  Territory_result: 0.3611
  Area_result: 0.3607
  Territory_score: -0.3214
  level_1: 0.1535
  step_count: 0.1069
  level_3: -0.0992
  winner_color_white: -0.0780

Principal Component 2:
  handicap: -0.4760
  Territory_score: 0.3995
  Territory_result: 0.3978
  Area_result: 0.3957
  Area score: 0.3050
  Area score of opponent: -0.1660
  starter_player_black: 0.1346
  starter_player_white: -0.1346
  result_resign: -0.1192
  result_score: 0.1192

Principal Component 3:
  step_count: 0.6217
  Area score of opponent: 0.2453
  Area_winner_color_B: -0.2103
  Area_winner_color_W: 0.2103
  Territory_winner_color_B: -0.1943
  Territory_winner_color_W: 0.1943
  starter_player_black: 0.1703
  starter_player_white: -0.1703
  Player_mangochia: 0.1702
  rules_Japanese: -0.1672

Principal Component 4:
  handicap: -0.5060
  starter_player_black: 0.2927
  starter_player_white: -0.2927
  Territory_result: -0.2093
  Area_result: -0.2066
  Territory_score: -0.1981
  Area score: -0.1891
  step_count: -0.1805
  result_resign: 0.1403
  result_score: -0.1403

Principal Component 5:
  cs: 0.2743
  step_count: 0.2162
  Area_winner_color_W: -0.2052
  Area_winner_color_B: 0.2052
  Territory_winner_color_B: 0.2001
  Territory_winner_color_W: -0.2001
  winner_color_white: -0.1990
  winner_color_black: 0.1990
  Territory_score_of_opponent: -0.1468
  Territory_score: -0.1278

Principal Component 6:
  Territory_score: 0.1707
  la: -0.1652
  level_2: -0.1544
  fj: 0.1515
  Territory_score_of_opponent: 0.1491
  pp: 0.1246
  or: 0.1243
  result_resign: 0.1221
  result_score: -0.1221
  ff: -0.1164

Principal Component 7:
  color_black: -0.1715
  color_white: 0.1715
  qc: 0.1533
  dp: 0.1531
  me: 0.1478
  ag: 0.1372
  mc: 0.1272
  ds: 0.1237
  lb: 0.1192
  ii: 0.1173

Principal Component 8:
  ns: 0.3316
  ss: 0.1855
  rs: 0.1640
  aa: 0.1541
  ms: 0.1257
  is: 0.1252
  jg: -0.1211
  js: 0.1210
  gr: 0.1118
  qk: 0.1114

Principal Component 9:
  ag: -0.1653
  gs: 0.1578
  hs: 0.1562
  cs: -0.1557
  af: -0.1458
  iq: 0.1443
  bm: 0.1425
  mm: -0.1244
  is: 0.1240
  oq: 0.1195

Principal Component 10:
  qk: 0.1761
  la: -0.1503
  rp: -0.1480
  fp: 0.1460
  kl: -0.1366
  ds: 0.1300
  oi: -0.1254
  lb: -0.1189
  hd: -0.1184
  bc: -0.1146

In [29]:
# Plot the explained variance ratio
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(explained_variance))
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by Principal Components (All data)')
plt.grid(True)
plt.show()

# Plot the same explained variance ratio plot as before but only displaying the first 10 principal components
plt.figure(figsize=(10, 6))
plt.plot(np.cumsum(explained_variance)[:10], marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by the first 10 Principal Components (All data)')
plt.xticks(ticks=range(10), labels=range(1, 10 + 1))  
plt.grid(True)
plt.show()
No description has been provided for this image
No description has been provided for this image

Brief insights from PCA for all_data (this includes all features)

Principal Component 1: It mainly deals with the differences in territory and area scores, showing how these scores influence the game's outcome and overall control of the board.

Principal Component 2: It focuses on the effect of game handicaps and the final scores, highlighting how initial advantages or disadvantages impact the result.

Principal Component 3: It is all about the number of moves and how the game's progression ties into scoring patterns, with some emphasis on the order of players and rules used.

Principal Component 4: It captures the role of handicaps and which player starts, showing how these aspects shape the game’s balance and scoring.

Principal Component 5: It looks at how certain player attributes and winning conditions influence the outcome, especially the role of player color and related strategies.

In [30]:
player_info_pca
Out[30]:
array([[-2.56972002e-01, -1.13613815e+00,  2.21209921e+00, ...,
        -1.23202326e-15,  2.98885460e-16, -1.76560590e-17],
       [-2.57278066e-01, -1.13578791e+00,  2.20957852e+00, ...,
         1.81602177e-16, -2.05224192e-16,  9.55429946e-18],
       [-2.56972002e-01, -1.13613815e+00,  2.21209921e+00, ...,
         1.66896907e-16,  2.85589067e-16, -1.18001352e-18],
       ...,
       [ 2.90282725e+00, -2.53119759e+00,  1.88180756e+00, ...,
         9.63906110e-18, -3.57986474e-18,  3.69702945e-18],
       [ 2.90313332e+00, -2.53154785e+00,  1.88432841e+00, ...,
         4.10361198e-18,  1.75310587e-18, -2.98190384e-18],
       [ 2.90282725e+00, -2.53119759e+00,  1.88180756e+00, ...,
         4.81259742e-18, -1.42628565e-18,  1.91088259e-18]])
In [31]:
# pca_index corresponds to the entire index of player_info_columns
pca_index = player_info_columns.index

# Check if lengths match
print("Length of pca_index:", len(pca_index))
print("Length of player_info_pca:", len(player_info_pca))

# Create the pca_df DataFrame using the first 3 principal components
player_info_pca = pipeline.fit_transform(all_data)
pca_df = pd.DataFrame(player_info_pca[:, :3], columns=['PC1', 'PC2', 'PC3'])

# Use pca_index to get the correct 'level' values from player_info_columns
pca_df['level'] = player_info_columns.loc[pca_index, 'level'].values

# JUST SOME CHECKS
print("Checks begin here...")
print(player_info_columns['level'].unique())
print(player_info_columns['level'].value_counts())
print(pca_df['level'].unique())
print("Checks end here.")

# Plotting PC1 vs PC2
plt.figure(figsize=(8, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='level', palette='viridis')
plt.title('PC1 vs PC2')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Player level', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

# Plotting PC1 vs PC3
plt.figure(figsize=(8, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC3', hue='level', palette='viridis')
plt.title('PC1 vs PC3')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 3')
plt.legend(title='Player level', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

# Plotting PC2 vs PC3
plt.figure(figsize=(8, 6))
sns.scatterplot(data=pca_df, x='PC2', y='PC3', hue='level', palette='viridis')
plt.title('PC2 vs PC3')
plt.xlabel('Principal Component 2')
plt.ylabel('Principal Component 3')
plt.legend(title='Player level', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

# 3D Plot of PC1, PC2, and PC3
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
sc = ax.scatter(pca_df['PC1'], pca_df['PC2'], pca_df['PC3'], c=pca_df['level'], cmap='viridis')
ax.set_title('3D Plot of PC1, PC2, and PC3')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
plt.legend(title='Player level', bbox_to_anchor=(1.05, 1), loc='upper right')
plt.colorbar(sc, label='Level')
plt.show()
Length of pca_index: 9755
Length of player_info_pca: 9755
Checks begin here...
[1 2 3]
level
2    5677
1    2434
3    1644
Name: count, dtype: int64
[1 2 3]
Checks end here.
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No description has been provided for this image
In [32]:
fig = px.scatter_3d(pca_df, x=pca_df['PC1'], y=pca_df['PC2'], z=pca_df['PC3'],
              color=pca_df['level'])
fig.show()

Comments¶

  • Which features did you use? Why?

    • first 3 Principal components
    • level of the player (beginner / intermediate / master)

We used the first 3 PC, which used these features:

Most important features for each principal component:

Principal Component 1: Territory_score_of_opponent: 0.4352 Area score of opponent: 0.4163 Area score: -0.4071 Territory_result: 0.3611 Area_result: 0.3607 Territory_score: -0.3214 level_1: 0.1535 step_count: 0.1069 level_3: -0.0992 winner_color_black: 0.0780

Principal Component 2: handicap: 0.4760 Territory_score: -0.3995 Territory_result: -0.3978 Area_result: -0.3957 Area score: -0.3050 Area score of opponent: 0.1660 starter_player_white: 0.1346 starter_player_black: -0.1346 result_resign: 0.1192 result_score: -0.1192

Principal Component 3: step_count: 0.6217 Area score of opponent: 0.2453 Area_winner_color_B: -0.2103 Area_winner_color_W: 0.2103 Territory_winner_color_B: -0.1943 Territory_winner_color_W: 0.1943 starter_player_white: -0.1703 starter_player_black: 0.1703 Player_mangochia: 0.1702 rules_Japanese: -0.1672

  • Which projection methods did you use? Why?

    • We tried PCA, TSNE and UMAP as well, but PCA generated the nicest and more insightful plots
    • From the first PCA plots we can draw the conlcusion that a huge amount of the cumulative variance of the data is explained by them (the PC-s)
  • Why did you choose these hyperparameters?

    • When experimenting with the TSNE, we did a gridsearch
    • with PCA, we used the default settings for the projection, and used the first 3 components -we printed out that which principle components explain the data the best
  • Are there patterns in the global and the local structure?

    • players from similars levels follow a similar approach
      • we can clearly seperate the different player levels from each other

Meta Data Encoding¶

Encode addtional features in the visualization.

Use features of the source data and include them in the projection, e.g., by using color, opacity, different shapes, or line styles, etc.

In [33]:
# 1. PCA Implementation
# ---------------------------------
"""pca = PCA(n_components=3)  # We are interested in the first 3 components
pca_components = pca.fit_transform(player_info_columns[numerical_features])

# Create a DataFrame for easy handling of PCA components
pca_df = pd.DataFrame(pca_components, columns=['PC1', 'PC2', 'PC3'])"""


# Optionally, add meta-data features for encoding
pca_df['level'] = player_info_columns['level'].values
pca_df['winner_color'] = player_info_columns['winner_color'].values
pca_df['result'] = player_info_columns['result'].values
pca_df['starter_player'] = player_info_columns['starter_player'].values
pca_df['Area_winner_color'] = player_info_columns['Area_winner_color'].values
pca_df['Territory_winner_color'] = player_info_columns['Territory_winner_color'].values
pca_df['color'] = player_info_columns['color'].values
pca_df['Our_players_colour'] = player_info_columns['Our_players_colour'].values


# ---------------------------------
# 2. PCA Plots with Meta-Data Encoding
# ---------------------------------
# PC1 vs PC2
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='level', style='result', palette='viridis')
plt.title('PC1 vs PC2 with Level and Result Encoding')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Level / result', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()

plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='Area_winner_color', style='Territory_winner_color', palette='viridis')
plt.title('PC1 vs PC2 with Area_winner_color and Territory_winner_color')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Area_winner_color / Territory_winner_color', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()

plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='winner_color', style='Our_players_colour', palette='viridis')
plt.title('PC1 vs PC2 with winner_color and Our player colour')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='winner_color / color', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()

plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='result', style='Our_players_colour', palette='viridis')
plt.title('PC1 vs PC2 with level and and Our player colour')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='level / color', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()

# PC1 vs PC3
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC3', hue='result', style='level', palette='coolwarm')
plt.title('PCA: PC1 vs PC3 with Result and Level Encoding')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 3')
plt.legend(title='Result / Level', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()

# PC2 vs PC3
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='PC2', y='PC3', hue='winner_color', style='result', palette='Set2')
plt.title('PCA: PC2 vs PC3 with Winner Color and Result Encoding')
plt.xlabel('Principal Component 2')
plt.ylabel('Principal Component 3')
plt.legend(title='Winner Color / Result', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [34]:
def determine_phase(df):
    # Group by game_id
    df['phase'] = df.groupby('game_id')['move_id'].transform(
        lambda x: ["starter" if step <= 20 
                   else "final" if step > (x.max() - 20) 
                   else "intermediate"
                   for step in x]
    )
    return df

# Apply the function to the DataFrame
player_info_columns = determine_phase(player_info_columns)
/tmp/ipykernel_8097/1667704383.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [35]:
pca_df['phase'] = player_info_columns['phase'].values
pca_df['Player']=player_info_columns['Player'].values
In [36]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='winner_color', style='result', alpha=0.7, palette='viridis')
plt.title('PCA: PC1 vs PC2 with winner_color and result')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='winner_color / result', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()
No description has been provided for this image

Here we can see that whether the white player wins, in most cases, his opponent resign. If black wins, in that case most of the time he wins with scores (not resign)

(Black plays first unless given a handicap of two or more stones, in which case White plays first)

In [37]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df.query('Player != "mangochia"'), x='PC1', y='PC2', hue=player_info_columns['step_count'], style='result', alpha=0.7, palette='viridis')
plt.title('PCA: PC1 vs PC2 with number of all steps and result')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='winner_color / result', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()
No description has been provided for this image

We can see that the games with a lot of step, like above 360, they usually end without resigning, the players finishes the game until both of them pass that step.

In [38]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=pca_df.query('Player != "mangochia"'), x='PC1', y='PC2', hue=player_info_columns['step_count'], style='level', alpha=0.7, palette='viridis')
plt.title('PCA: PC1 vs PC2 with winner_color and result')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='winner_color / result', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()
No description has been provided for this image

We can see here that usually the master players have games with much fewer steps.

Comments¶

  • Which features did you use? Why?

    -level -result -area_winner_color -territroy_winner_color -winner_color -Our_players_colour -phase (starter step, intermediate, final steps) -step_count

    These were the feautres with the projection plots where we saw some useful insgiths, nice clusters, where we could draw the follwoing conlcusions: (Some examples) - we can see that whether the white player wins, in most cases, his opponent resign. If black wins, in that case most of the time he wins with scores (not resign) - We can see that the games with a lot of step, like above 360, they usually end without resigning, the players finishes the game until both of them pass that step. - We can see that usually the master players have games with much fewer steps.

    As for the Link states plots: See below the section Link States

  • How are the features encoded?

    • We built a pipline, where we separated features into numerical and categorical features
    • We used StandardScaler for the numerical features
    • One-hot encoding for the categorical feautures

Link States¶

Connect the states that belong together.

The states of a single solution should be connected to see the path from the start to the end state. How the points are connected is up to you, for example, with straight lines or splines.

In [39]:
# Apply PCA
pca = PCA(n_components=2)
pca_board_states = pca.fit_transform(board_state_scaled)

# Combine PCA results with game_data
game_data_pca = all_data.copy()
game_data_pca[['PC1', 'PC2']] = pca_board_states

# Determine the phase and add to `game_data_pca`
game_data_pca = determine_phase(game_data_pca)



 #Ensure move_id is numeric
game_data_pca['move_id'] = pd.to_numeric(game_data_pca['move_id'], errors='coerce')

# Filter and sort by move_id
filtered_pca_GO_1 = game_data_pca[game_data_pca['game_id'] == 31].sort_values('move_id')
filtered_pca_GO_2 = game_data_pca[game_data_pca['game_id'] == 23].sort_values('move_id')

# Visualization with lines and nodes
lines_1 = alt.Chart(filtered_pca_GO_1).mark_line(
    opacity=0.3,
    strokeWidth=1.5
).encode(
    x='PC1',
    y='PC2',
    color=alt.Color('phase:N', scale=alt.Scale(domain=['starter', 'intermediate', 'final'], range=['blue', 'green', 'red'])),
    order='move_id:Q'
)

nodes_1 = alt.Chart(filtered_pca_GO_1).mark_circle(size=30, opacity=0.6).encode(
    x='PC1',
    y='PC2',
    color=alt.Color('phase:N', scale=alt.Scale(domain=['starter', 'intermediate', 'final'], range=['blue', 'green', 'red']))
)
# Visualization with lines and nodes
lines_2 = alt.Chart(filtered_pca_GO_2).mark_line(
    opacity=0.3,
    strokeWidth=1.5
).encode(
    x='PC1',
    y='PC2',
    color=alt.Color('phase:N', scale=alt.Scale(domain=['starter', 'intermediate', 'final'], range=['blue', 'green', 'red'])),
    order='move_id:Q'
)

nodes_2 = alt.Chart(filtered_pca_GO_2).mark_circle(size=30, opacity=0.6).encode(
    x='PC1',
    y='PC2',
    color=alt.Color('phase:N', scale=alt.Scale(domain=['starter', 'intermediate', 'final'], range=['blue', 'green', 'red']))
)


# Combine lines and nodes
path_chart_with_nodes_1 = (lines_1 + nodes_1).properties(
    width=500,
    height=500,
    title="Paths of Begginer Player"
).interactive()
path_chart_with_nodes_2 = (lines_2 + nodes_2).properties(
    width=500,
    height=500,
    title="Paths of Master Player"
).interactive()

path_chart_with_nodes_1 | path_chart_with_nodes_2
Out[39]:

This analysis visualizes the progression of moves for a master player (game_id = 23) and a beginner player (game_id = 31) in a Go game. The data has been processed using Principal Component Analysis (PCA) to reduce the dimensionality of the game state data, enabling a 2D visualization of each player's moves.

  • Color Coding by Game Phase: The moves are color-coded based on the phase of the game:

    • Blue: Opening phase (starter)
    • Green: Middle game phase (intermediate)
    • Red: Endgame phase (final)
  • Line and Node Visualization:

    • Lines represent the sequential progression of moves, ordered by move_id.
    • Nodes represent individual moves at each point in the sequence.

By displaying both visualizations side by side, we can compare the strategic choices and decision-making patterns of the master player versus the beginner player.

We can from the plot that the master player has significantly less moves.

In [40]:
player_info_columns['Date'] = player_info_columns['Date'].astype(str)

# t-SNE Projection
tsne = TSNE(n_components=2, perplexity=30, n_iter=300)
tsne_result = tsne.fit_transform(board_state_columns)
player_info_columns['tSNE1'] = tsne_result[:, 0]
player_info_columns['tSNE2'] = tsne_result[:, 1]

chart1 = alt.Chart(player_info_columns.query('move_id > 250 and level == 1')).mark_line(
    opacity=0.6,
    point=alt.MarkConfig(size=50)
).encode(
    x='move_id:Q',
    y='tSNE2:Q',
    color='game_id',
    detail='game_id:N',
    order='move_id:Q',
    tooltip=['game_id', 'move_id', 'winner_color']
).properties(
    width=700,
    height=400,
    title="Move Trajectories by Winner Color - End - Level 1"
).interactive()

chart2 = alt.Chart(player_info_columns.query('move_id > 300 and level == 2')).mark_line(
    opacity=0.6,
    point=alt.MarkConfig(size=50)
).encode(
    x='move_id:Q',
    y='tSNE2:Q',
    color='game_id',
    detail='game_id:N',
    order='move_id:Q',
    tooltip=['game_id', 'move_id', 'winner_color']
).properties(
    width=700,
    height=400,
    title="Move Trajectories by Winner Color - End - Level 2"
).interactive()

chart3 = alt.Chart(player_info_columns.query('move_id > 300 and level == 3')).mark_line(
    opacity=0.6,
    point=alt.MarkConfig(size=50)
).encode(
    x='move_id:Q',
    y='tSNE2:Q',
    color='game_id',
    detail='game_id:N',
    order='move_id:Q',
    tooltip=['game_id', 'move_id', 'winner_color']
).properties(
    width=700,
    height=400,
    title="Move Trajectories by Winner Color - End - Level 3"
).interactive()

alt.concat(chart1, chart2, chart3, columns=3).resolve_scale(color='independent')
/tmp/ipykernel_8097/1591916745.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/tmp/ipykernel_8097/1591916745.py:6: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/tmp/ipykernel_8097/1591916745.py:7: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[40]:

The plots above show the last steps of some of the games, each plot showing only games from one level. Different colors indicate different game_id's, i.e. different games. It is visible, that most games end at the same point, at value 0 on the y axis. This indicates a common end stategy of the games. It is visible, that one game from the level 1 games does not follow this pattern. This may be explained by different game stategies of beginner players.

In [41]:
# PCA
pca_df['move_id'] = player_info_columns['move_id'].values
pca_df['game_id'] = player_info_columns['game_id'].values


player_info_columns['Date'] = player_info_columns['Date'].astype(str)
alt.Chart(pca_df.query('move_id < 10')).mark_line(
    opacity=0.6,
    point=alt.MarkConfig(size=50)
).encode(
    x='move_id:Q',
    y='PC1:Q', 
    color='level:N',
    detail='game_id:N',
    order='move_id:Q',
    tooltip=['game_id', 'move_id', 'winner_color']
).properties(
    width=700,
    height=400,
    title="Move Trajectories by Winner Color using PCA"
).interactive()
/tmp/ipykernel_8097/2616770832.py:6: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[41]:

The graph above shows the first ten steps of the investigated GO games, the colors indicating the levels of the individual games. It is clearly visible, that the areas in the plot can approximately be divided according to the three levels. From this it can be assumed, that players of each level seem to follow similar approaches.

In [42]:
alt.Chart(player_info_columns).mark_rect().encode(
    x=alt.X('game_id:O', title='Game ID', axis=alt.Axis(labelAngle=-45)),
    y=alt.Y('level:O', title='Level'),
    color=alt.Color('count(move_id):Q', scale=alt.Scale(scheme='viridis')),
    tooltip=['game_id', 'level', 'count(move_id)']
).properties(
    width=700,
    height=400,
    title="Heatmap of Move Counts by Game ID and Level"
).interactive() 
Out[42]:

The heatmap above shows the number of moves of the individual games, sorted by the different levels and games. Except for individual outliers, it can be seen, that games of level one and three had less moves, i.e. they were shorter, than games of the intermediate level. This may be explained by the following: Beginners don't have good stategies to win and therefore loose the game quite quickly. Experts have good winning stategies and therefore win quickly. Intermediate players, however, have stategies to not loose immediately, but they also don't win immediately.

Optional¶

Projection Space Explorer (click to reveal)

Projection Space Explorer

The Projection Space Explorer is a web application to plot and connect two dimensional points. Metadata of the data points can be used to encode additonal information into the projection, e.g., by using different shapes or colors.

Further Information:

  • Paper: https://jku-vds-lab.at/publications/2020_tiis_pathexplorer/
  • Repo: https://github.com/jku-vds-lab/projection-space-explorer/
  • Application Overview: https://jku-vds-lab.at/pse/

Data Format

How to format the data can be found in the Projection Space Explorer's README.

Example data with three lines, with two colors (algo) and additional mark encoding (cp):

x y line cp algo
0.0 0 0 start 1
2.0 1 0 state 1
4.0 4 0 state 1
6.0 1 0 state 1
8.0 0 0 state 1
12.0 0 0 end 1
-1.0 10 1 start 2
0.5 5 1 state 2
2.0 3 1 state 2
3.5 0 1 state 2
5.0 3 1 state 2
6.5 5 1 state 2
8.0 10 1 end 2
3.0 6 2 start 2
2.0 7 2 end 2

Save the dataset to CSV, e.g. using pandas: df.to_csv('data_path_explorer.csv', encoding='utf-8', index=False)
and upload it in the Projection Space Explorer by clicking on OPEN FILE in the top left corner.

ℹ You can also include your high dimensionmal data and use it to adapt the visualization.

Results¶

You may add additional screenshots of the Projection Space Explorer.

Interpretion¶

What can be seen in the projection(s)?

  • Players of the same level have similar approaches
  • Players of different levels have different approaches (master player has different approach than beginner player)
  • On average, master games are shorter (contain less steps) than beginner games
  • Beginner games tend to include more abrupt phase transitions (for example tend to end games more abruptly)
  • Ending states are not so diverse (Most games end at the same point)
  • There are some games where different players won based on Area and Territory scores
  • Most master level games tend to end with resigning, it is also a tendency for beginner games, but not necessarily for intermediate games

Was it what you expected? If not what did you expect?

Mostly yes, but there were some exceptions:

  • Ending states are not so diverse → we expected to identify more diverse ending states
  • For both master and beginner games it is a tendency to end the game with the opponent resigning → we only expected to see this for master games

Can you confirm prior hypotheses from the projection?

Yes, for example these ones:

  • Different approaches by the expertise level of the players: master players display more sophisticated strategies
  • On average, master games are shorter (contain less steps) than beginner games

Did you get any unexpected insights?

  • Players of the same level have similar approaches
  • Ending states are not so diverse (Most games end at the same point)
  • Most master level games tend to end with resigning, it is also a tendency for beginner games, but not necessarily for intermediate games

Submission¶

When you’ve finished working on this assignment please download this notebook as HTML and add it to your repository in addition to the notebook file.